The course is about learning to use R with statistics in, with an emphasis on using it with github. My objective for attending the course is refreshing my knowledge on R and statistics in general. The course has been on my “TODO” plan on PhD studies for a while.
I think the R Data Health book works well for an introduction, although - as I’m not a complete novice. I didn’t read the chapters quite thoroughly - there’s a lot to take in in short time (I missed the exercises in chapter 3, but not chapter 4). Perhaps the book works more as a sample now and later as a reference.
There have been quite a few new things since I last used R, like the
tidyverse pipe. I’m glad there is a possibility of using the
“language-agnostic standard” of using = for assignment
instead of (or in addition to) <-, which I’ve always
found odd. A lot of things are similar to features in pandas in Python
data analysis (not surprising, as pandas has taken quite a bit of
inspiration from R).
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Sat Dec 2 21:11:07 2023"
GitHub repository: https://github.com/dustedmtl/IODS-project
Data analysis
date()
## [1] "Sat Dec 2 21:11:07 2023"
The dataset that is read in contains averaged survey scores for three subjects, student attitude, exam points, age, and gender. There are 166 students.
lr2 <- read.table('data/learning2014.csv', sep=',', header = T)
dim(lr2)
## [1] 166 7
summary(lr2)
## gender age attitude points
## Length:166 Min. :17.00 Min. :14.00 Min. : 7.00
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00
## Mode :character Median :22.00 Median :32.00 Median :23.00
## Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :55.00 Max. :50.00 Max. :33.00
## deep surf stra
## Min. :1.583 Min. :1.583 Min. :1.250
## 1st Qu.:3.333 1st Qu.:2.417 1st Qu.:2.625
## Median :3.667 Median :2.833 Median :3.188
## Mean :3.680 Mean :2.787 Mean :3.121
## 3rd Qu.:4.083 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.917 Max. :4.333 Max. :5.000
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
p <- ggpairs(lr2, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
The score variables deep, surf and stra are distributed between 1 and 5.
Correlations in the data:
points and attitude are strongly correlated
points and stra are also correlated; likewise points and surf (negative correlation)
age skews young and it isn’t strongly correlated with anything.
Linear regression is used to determine relation between a response variable and one of more (if multiple regression) explanatory variables. For a single variable, the equation to solve is:
\(Y = \alpha + \beta X + \epsilon\)
where all the terms are \(N\)-dimensional vectors (for \(N\) observations). \(Y\) is the response variable and \(X\) is the explanatory variable. The \(\epsilon\) error term is assumed to normally distributed with a mean of 0.
The task is to find such coefficients for \(\alpha\) and \(\beta\) that minimize the sum \(\sum_{i=1}^{N} \epsilon_i^2\). This is called the least squares method.
For multiple linear regression the equation would be similarly \(Y = \alpha + X\beta + \epsilon\), where \(X\) would be a matrix of dimensions \(N * k\) (with \(k\) explanatory variables) instead of a \(N * 1\)-dimensional vector.
Linear regression model for variable points based on three explanatory variables: attitude, stra and surf:
model <- lm(points ~ attitude + stra + surf, data = lr2)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lr2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The test that the fitting is done is omnibus F-test, which tests the hypothesis that coefficients for all three variables are zero. The very low p-value for intercept means that this hypothesis is not likely to be true.
The p-value for attitude (below 0.05) shows that the model fitting is reliable for it.
Multiple R squared value of 0.2074 means that the three variables account for 20 % of variation.
# All four in the same plot
par(mfrow = c(2,2))
plot(model, which = c(1,2,5))
Above are three plots regarding residuals, where residual is a difference between the fitted and actual value.
The first plot (residuals vs fitted) is symmetrical and there is no correlation between residual and fitted value. Linear regression model is appropriate here.
The second plot shows quantiles against each other. This plot is linear too, so linear regression model is still valid.
The last plot uses standardized residuals with identical variance. The result is similar to plot 1.
date()
## [1] "Sat Dec 2 21:11:09 2023"
In the previous section we covered linear regression. Here we consider some cases where its use may be inappropriate. In particular, linear regression relies on the response variable being normally distributed. It might not even be continuous. In the case of a binary response variable we would prefer to use logistic regression
\(log(\frac p {1-p}) = \alpha + \beta * X\)
where \(p\) is the probability of one outcome and \(1-p\) probability of the other outcome, with \(\beta\) and \(X\) being coefficient and variable vectors for explanatory variables, which are multiplied element-wise.
Here \(\frac p {1-p}\) are the odds for one outcome against another outcome (the usability of using odds instead of probabilities comes from fact that the fitted lines featuring the former are expected to be straight)
The data set for analysis comes from studies about Portuguese students’ alcohol consumption:
http://www.archive.ics.uci.edu/dataset/320/student+performance
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc <- read.table('data/alc_data.csv', sep=',', header = T)
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
The data has been combined from mathematics and Portuguese classes:
The data has a number of variables related to grading (G1, G2 and G3), absences, extra classes (paid) and previous failures (failures). It also has information about alcohol usage, including a boolean column for high alcohol usage (high_use). The rest are a variety of either numerical or categorical metadata columns.
The objective is to analyze the relation of alcohol consumption to the various categories. The hypothesis is that high alcohol consumption will lead to
lower grades (especially G3)
more absences
The student will also have spent less time on studying (studytime) and will not have had paid classes.
library(tidyr); library(dplyr); library(ggplot2)
# install.packages("patchwork")
library(patchwork)
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 <- g1 + geom_boxplot() + ylab("final grade")
g11 <- ggplot(alc, aes(x = sex, y = G3, col = high_use))
g11 <- g11 + geom_boxplot() + ylab("final grade")
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 <- g2 + geom_boxplot() + ylab("absences")
g3 <- ggplot(alc, aes(x = studytime, y = high_use))
g3 <- g3 + geom_col() + ylab("high use")
g4 <- ggplot(alc, aes(x = paid, y = high_use))
g4 <- g4 + geom_boxplot() + ylab("high use")
g1 + g11 + g2 + g3
Grades are lower and there are more absences with high alcohol usage, although the effect is more pronounced for men than women. Students with high alcohol consumption spent a lot less time on studying.
gh <- ggplot(alc, aes(x = studytime))
gh <- gh + geom_histogram()
gh2 <- ggplot(alc, aes(x = studytime))
gh2 <- gh2 + geom_histogram()
gh + gh2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
table(high_use = alc$high_use, paid = alc$paid)
## paid
## high_use no yes
## FALSE 140 119
## TRUE 56 55
Those with low alcohol consumption used fewer paid classes, although the difference is not dramatic. My hypothesis was thus incorrect; it’s possible that students who use a lot of alcohol need the extra classes (perhaps to compensate for lack of study time?).
m <- glm(high_use ~ G3 + absences + studytime + paid, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ G3 + absences + studytime + paid, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.44301 0.51534 0.860 0.389984
## G3 -0.06369 0.03725 -1.710 0.087317 .
## absences 0.07688 0.02285 3.365 0.000766 ***
## studytime -0.58041 0.16270 -3.567 0.000361 ***
## paidyes 0.40796 0.24619 1.657 0.097501 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 414.63 on 365 degrees of freedom
## AIC: 424.63
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) G3 absences studytime paidyes
## 0.44300723 -0.06368689 0.07687909 -0.58041234 0.40795610
The low P-values for absences and studytime indicate high correlation. Surprisingly (?) grades do not.
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 1.5573836 0.5663598 4.305756
## G3 0.9382987 0.8717178 1.009274
## absences 1.0799115 1.0350021 1.132305
## studytime 0.5596675 0.4024071 0.762707
## paidyes 1.5037411 0.9304002 2.446804
For grades (G3) and paid classes, 1.0 is within the confidence interval, thus there is no evidence that these variables are associated with high alcohol usage.
m2 <- glm(high_use ~ absences + studytime, data = alc, family = "binomial")
probabilities <- predict(m2, type = "response")
# add probability
alc2 <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)
table(high_use = alc2$high_use, prediction = alc2$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 93 18
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67027027 0.02972973 0.70000000
## TRUE 0.25135135 0.04864865 0.30000000
## Sum 0.92162162 0.07837838 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc2$high_use, prob = alc2$prediction)
## [1] 0.2810811
The probability of incorrect predictions is 28.1 %.
table(high_use = alc2$high_use, high_absences = alc2$absences > 1) %>% prop.table() %>% addmargins()
## high_absences
## high_use FALSE TRUE Sum
## FALSE 0.23513514 0.46486486 0.70000000
## TRUE 0.07027027 0.22972973 0.30000000
## Sum 0.30540541 0.69459459 1.00000000
table(high_use = alc2$high_use, low_study = alc2$studytime < 2) %>% prop.table() %>% addmargins()
## low_study
## high_use FALSE TRUE Sum
## FALSE 0.5486486 0.1513514 0.7000000
## TRUE 0.1864865 0.1135135 0.3000000
## Sum 0.7351351 0.2648649 1.0000000
table(high_use = alc2$high_use, low_study_high_absence = alc2$studytime < 2 | alc2$absences > 1) %>% prop.table() %>% addmargins()
## low_study_high_absence
## high_use FALSE TRUE Sum
## FALSE 0.19189189 0.50810811 0.70000000
## TRUE 0.05405405 0.24594595 0.30000000
## Sum 0.24594595 0.75405405 1.00000000
Guessing based on simple metrics such as low studytime and high absences it is not possible to get good predictions.
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2837838
Unexpectedly the cross-validation does not improve the previous result.
date()
## [1] "Sat Dec 2 21:11:10 2023"
In multivariate analysis several variables are analysed at the same. The difference between analysing a single response variable and multivariate analysis is that in the latter you do not expect to “explain” a single variable in terms of another, rather than (often) exploring connections between the variables.
It is useful to graphically explore the relations between variables, for example using boxplots, scatterplots. Normality of the data can be assessed with quantile plots.
The purpose of clustering is to group items into mutually exlcusive groups. The members of a single group should be closer to each other than to members of other groups.
A common metric for determining the difference between items is the Euclidean distance:
\(d_{ij} = \sqrt{\sum_{k=1}^{q}(x_{ik}-x{jk})^2}\)
Another option is manhattan distance, which assumes “block-wise” traversal, i.e. you can only go up/down or north/south along the streets.
The distance metrics are obviously sensitive to different scales; for it to work the data may have to be scaled around the mean and normalized according to standard deviation:
\(scaled(x) = \frac{x - mean(x)}{ sd(x)}\)
In hierarchical clustering each item is iteratively connected to their closest sub-clusters (either single items or clusters previously defined based on them). There are three common ways to measure the distance between clusters:
Single linkage
Maximum linkage
Average linkage
In single linkage, the distance between clusters is based on the closest distance between any members of the respective clusters. Maximum and average linkage are based on maximum and average distance, respectively. Hieratchical clustering produces a tree-like dendrogram.
The k-means algorithm seeks group items into \(k\) clusters based on some criteria. Distance metrics (over all items) may be used. The issue with this method is that for a large amount of data, calculating the overall distances may be computationally infeasible. An alternative is to choose some initial clustering and then iteratively make small changes to them, only keeping those changes that improve whatever criteria are used to measure “best fit”.
We are analysing Boston housing market data and its relations to various variables. The details about the data set are found here:
https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
## corrplot 0.92 loaded
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
In addition to median house price, the variables include: crime rate; taxes; ratio of black population; access to education, employment and transportation, among other things.
Renaming the variables to be more descriptive:
names(Boston) = c("crime", "zone.pct", "ind.pct",
"river", "nox.ppm",
"rooms.avg", "age.pct",
"dist", "roads.idx", "tax",
"pupil.ratio", "black",
"stat.pct", "price")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crime : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zone.pct : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ ind.pct : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ river : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox.ppm : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rooms.avg : num 6.58 6.42 7.18 7 7.15 ...
## $ age.pct : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dist : num 4.09 4.97 4.97 6.06 6.06 ...
## $ roads.idx : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ pupil.ratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ stat.pct : num 4.98 9.14 4.03 2.94 5.33 ...
## $ price : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
## crime zone.pct ind.pct river nox.ppm rooms.avg age.pct dist roads.idx tax
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222
## pupil.ratio black stat.pct price
## 1 15.3 396.90 4.98 24.0
## 2 17.8 396.90 9.14 21.6
## 3 17.8 392.83 4.03 34.7
## 4 18.7 394.63 2.94 33.4
## 5 18.7 396.90 5.33 36.2
## 6 18.7 394.12 5.21 28.7
The range of the values of the data vary; some are percentages, some are ratios, the rest have a variety of positive values (mainly above 1).
cor_matrix <- cor(Boston) %>% round()
corrplot(cor_matrix,
method="circle", type = "upper",
cl.pos = "b",
tl.pos = "d", tl.cex = 0.6)
The correlation matrix shows positive correlations between crime and taxation and road access. The house price is positively correlated with average number of rooms (there is no variable for house size itself) and negatively correlated with pupil-teacher ratio (i.e. fewer teachers per pupil) and lower status of the population.
Access to river doesn’t seem to be correlated with anything.
To be able to properly analyze the data, we need to scale it.
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crime <- as.numeric(boston_scaled$crime)
summary(boston_scaled$crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
We create a categorical variable for the crime rates.
bins <- quantile(boston_scaled$crime)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim,
breaks = bins,
include.lowest = TRUE,
labels = c("low", "med_low", "med_high", "high")
)
boston_scaled <- dplyr::select(boston_scaled, -crime)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
head(boston_scaled)
## zone.pct ind.pct river nox.ppm rooms.avg age.pct dist
## 1 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
## roads.idx tax pupil.ratio black stat.pct price crime
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278 low
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239 low
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375 low
## 4 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886 low
## 5 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323 low
## 6 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582 low
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Using crime as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The newly scaled variables have a mean of zero.
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
LDA is used to reduce the dimensions of the data. Here we want to see which variables are of most relevant to crime rates.
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2623762 0.2400990 0.2574257
##
## Group means:
## zone.pct ind.pct river nox.ppm rooms.avg age.pct
## low 0.91775959 -0.9238915 -0.06938576 -0.8904449 0.48247396 -0.9283360
## med_low -0.07626688 -0.2732250 -0.01233188 -0.5560333 -0.13335915 -0.3459839
## med_high -0.37054365 0.1935858 0.17414622 0.4229112 0.08587135 0.4154769
## high -0.48724019 1.0170690 -0.04518867 1.0562128 -0.46857744 0.7910589
## dist roads.idx tax pupil.ratio black stat.pct
## low 0.8480576 -0.6811386 -0.7699980 -0.46088995 0.3814883 -0.79318701
## med_low 0.3755945 -0.5452370 -0.4631497 -0.05702597 0.3186495 -0.12252381
## med_high -0.3926630 -0.4265815 -0.3140428 -0.26517540 0.1230960 0.06258688
## high -0.8516574 1.6386213 1.5144083 0.78135074 -0.8337407 0.87681080
## price
## low 0.58660081
## med_low -0.01259366
## med_high 0.17107333
## high -0.71302948
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zone.pct 0.09094912 0.61394151 -0.96260219
## ind.pct 0.06159580 -0.22926314 0.32792122
## river -0.04264060 0.01114121 0.14223408
## nox.ppm 0.41785090 -0.82427460 -1.30159773
## rooms.avg 0.05728181 -0.09630495 -0.21981881
## age.pct 0.20276326 -0.35374143 0.03272508
## dist -0.08467857 -0.22581145 0.34363845
## roads.idx 3.39719304 0.92744987 -0.19568163
## tax -0.03266278 0.05292419 0.64585704
## pupil.ratio 0.11938716 -0.05668705 -0.29630210
## black -0.10379200 0.01316014 0.09942475
## stat.pct 0.16053935 -0.21375288 0.25929034
## price 0.04766577 -0.40803990 -0.24767531
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9539 0.0341 0.0121
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
The first component explains 95% of the variable to be analyzed. The highest coefficient for it is roads.idx, that is, index of accessiblity to radial highways, followed by nitrous oxide pollution, percentage of old houses and lower status of population.
# target classes as numeric
classes <- as.numeric(train$crime)
plot(lda.fit,
dimen = 2,
col = classes,
pch = classes)
lda.arrows(lda.fit, myscale = 2)
The plot confirms that it is the roads access that has most relevance to crime rates.
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 11 3 0
## med_low 2 17 1 0
## med_high 0 12 15 2
## high 0 0 0 23
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table() %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 0.156862745 0.107843137 0.029411765 0.000000000 0.294117647
## med_low 0.019607843 0.166666667 0.009803922 0.000000000 0.196078431
## med_high 0.000000000 0.117647059 0.147058824 0.019607843 0.284313725
## high 0.000000000 0.000000000 0.000000000 0.225490196 0.225490196
## Sum 0.176470588 0.392156863 0.186274510 0.245098039 1.000000000
The method correctly classifies 74.5 % of the items.
sum(correct_classes == lda.pred$class) / length(correct_classes)
## [1] 0.6960784
# center and standardize variables
boston_scaled2 <- scale(Boston)
names(boston_scaled2) = c("crime", "zone.pct", "ind.pct",
"river", "nox.ppm",
"rooms.avg", "age.pct",
"dist", "roads.idx", "tax",
"pupil.ratio", "black",
"stat.pct", "price")
# change the object to data frame
boston_scaled2 = as.data.frame(boston_scaled2)
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
The Euclidean and Manhattan distances for the scaled dataset.
km <- kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
km2 <- kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km2$cluster)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d', mode='markers',
color=train$crime,
)
Some clustering is shown based on the crime categories.
date()
## [1] "Sat Dec 2 21:11:13 2023"
PCA is a method for transforming a set of correlated variables into a smaller set of uncorrelated and ordered variables: \(x_1 .. x_q \to y_1 .. y_q\).
The first component \(y_1\) is the one with the highest sample variance. Written in terms of coefficients, \(y_1 = \sum_{i=1}^q a_{1i} x_i\) with the restriction that the sum of squares of the coefficients \(a_i\) is 1, that is \(\sum a_i^2 = 1\). The rest of the coefficients are determined equivalently with additional restriction that the coefficients are uncorrelated with each other, e.g., for the second component \(\sum_{i=0}^q a_{1q}a_{2q} = 0\).
The coefficients can be extracted from the covariance or correlation matrices.
First of all, reducing the dimensions allows for easier analysis. PCA is especially useful for exploration, as we shall see.
For categorical variables, we can use Correspondence Analysis. In practical terms, the variables are converted to a correspondence matrix with the singular value decomposition method (SVD).
The data set for analysis is originally from the UN Development Programme:
The data set contains variables related to Human Development Index (HDI) and Gender Inequality Index (GII).
The HDI calculation takes into account:
life expectancy
expected amount education
per capita GNI
The GII calculation takes into account:
infant mortality
female empowerment
labour participation rates
The data has been massages to only contain a small set of variables.
library(corrplot)
library(tibble)
library(readr)
human <- read_csv("data/human_data_2.csv", show_col_types = FALSE)
human_ <- column_to_rownames(human, "Country")
summary(human_)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
str(human_)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : num 64992 42261 56431 44025 45435 ...
## $ Mat.Mor : num 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
The variables that remain are:
Male/female ratio for length of (and expected) education (Edu2.FM). Values above 1 mean females are educated for longer.
Male/female ratio for labour participation (Labo.FM)
Life Expectancy (Life.Exp)
Length of Education (Edu.Exp)
GNI Index (GNI)
Maternal Mortality Ratio (Mat.Mor). Infant deaths per 10000 births.
Adolescent Birth Rate (Ado.Birth)
Share of female parliamentary participation (percentage) (Parli.F)
# Access GGally
library(GGally)
# visualize the 'human_' variables
ggpairs(human_, progress = FALSE)
# Access corrplot
library(corrplot)
# compute the correlation matrix and visualize it with corrplot
cor(human_) %>% corrplot()
The two plots visualize the same correlation information in different ways. In the latter plot, the strength of the correlation is shown with the size of the circle and colors indicate sign of correlation (deep red = negative, deep blue = positive).
Labor market participation ratio by gender doesn’t seem to correlated with anything. Maternal mortality is strongly negatively correlated with life expectancy, female education ratio and length and positively correlated with adolescent birth rate; likewise with adolescent birth rate. The strong correlations can be seen from the scatterplots in the first plot. There are other obvious correlations with variables such as life expectancy and length of education.
Some variables have a sort-of normal looking distribution (not going to run a test to verify this); others are skewed with long tails (like GNI and maternal mortality). Both have a large number of countries in the low end.
library(tibble)
pca_human <- prcomp(human_)
# create and print out a summary of pca_human
s <- summary(pca_human)
# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5) * 100
# print out the percentages of variance
pc_lab <- print(pca_pr)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 99.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00
#paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human,
cex = c(0.8, 1),
col = c("grey40", "deeppink2"),
xlab = pc_lab[1],
ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
In the unscaled case the first component explains 99% of variance; the biplot doesn’t look useful. GNI explains it all, having the highest values by far.
human_std <- scale(human_)
pca_human_scaled <- prcomp(human_std)
# create and print out a summary of pca_human
s_scaled <- summary(pca_human_scaled)
# rounded percentanges of variance captured by each PC
pca_pr_scaled <- round(1*s_scaled$importance[2, ], digits = 5) * 100
# print out the percentages of variance
pc_lab_scaled <- print(pca_pr_scaled)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.605 16.237 9.571 7.583 5.477 3.595 2.634 1.298
# paste0(names(pca_pr_scaled), " (", pca_pr_scaled, "%)")
# draw a biplot
biplot(pca_human_scaled,
cex = c(0.8, 1),
col = c("grey40", "deeppink2"),
xlab = pc_lab_scaled[1],
ylab = pc_lab_scaled[2])
Scaling the data is quite important for as the data contains variables with varying distributions (e.g., ratios around 1 on the lower end, life expectancies in the mid-range and GNI at the higher end). The purple arrows show the direction and strength of correlation related to the components.
In the scaled case, the first two components account for around 70% of the variation. The second component seems to index female parliamentary participation percentage and labour market participation ratio and the first component the rest (either positively or negatively).
library(dplyr)
library(tidyr)
library(ggplot2)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
The tea time data set contains 300 results from a questionnaire about tea consumption.
head(tea_time)
## Tea How how sugar where lunch
## 1 black alone tea bag sugar chain store Not.lunch
## 2 black milk tea bag No.sugar chain store Not.lunch
## 3 Earl Grey alone tea bag No.sugar chain store Not.lunch
## 4 Earl Grey alone tea bag sugar chain store Not.lunch
## 5 Earl Grey alone tea bag No.sugar chain store Not.lunch
## 6 Earl Grey alone tea bag No.sugar chain store Not.lunch
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) +
facet_wrap("name", scales = "free")
Clearly tea time is a leisure time activity, mostly not being drunk at lunch time in a café (?). It is mostly drunk without lemon or milk.
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
The first dimension seems to be mostly indexing variable How (plain hot tea) and where (café or not). This is in line with the histogram plots.
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
From the MCA plot, it would appear that one goes to a tea shop to drink loose tea for a tastier drink.
(more chapters to be added similarly as we proceed with the course!)